Automated detection of acoustic signals is crucial for effective monitoring of vocal animals and their habitats across large spatial and temporal scales. Recent advances in deep learning have made high performance automated detection approaches accessible to more practitioners. However, there are few deep learning approaches that can be implemented natively in R. The ‘torch for R’ ecosystem has made the use of convolutional neural networks (CNNs) accessible for R users. Here, we provide an R package and workflow to use CNNs for automated detection and classification of acoustics signals from passive acoustic monitoring data. We provide examples using data collected in Sabah, Malaysia. The package provides functions to create spectrogram images from labeled data, compare the performance of different CNN architectures, deploy trained models over directories of sound files, and extract embeddings from trained models. The R programming language remains one of the most commonly used languages among ecologists, and we hope that this package makes deep learning approaches more accessible to this audience. In addition, these models can serve as important benchmarks for future automated detection work.
We are in a biodiversity crisis, and there is a great need for the ability to rapidly assess biodiversity in order to understand and mitigate anthropogenic impacts. One approach that can be especially effective for monitoring of sound-producing yet cryptic animals is the use of passive acoustic monitoring (Gibb et al. 2018), a technique that relies on autonomous acoustic recording units. PAM allows researchers to monitor acoustically active animals and their habitats at temporal and spatial scales that are impossible to achieve using only human observers. Interest in use of PAM in terrestrial environments has increased substantially in recent years (Sugai et al. 2019), due to the reduced price of autonomous recording units and improved battery life and data storage capabilities. However, the use of PAM often leads to the collection of terabytes of data that is time- and cost-prohibitive to analyze manually.
Automated detection for PAM data refers to identifying the start and stop time of signals of interest within a longer sound recording (Stowell 2022). Some of the early non-deep learning approaches for the automated detection of acoustic signals in terrestrial PAM data include binary point matching (Katz, Hafner, and Donovan 2016), spectrogram cross-correlation (Balantic and Donovan 2020), or the use of a band- limited energy detector and subsequent classifier, such as support vector machine (Clink et al. 2023; Kalan et al. 2015). Recent advances in deep learning have revolutionized image and speech recognition (LeCun, Bengio, and Hinton 2015 ), with important cross-over for the analysis of PAM data. Traditional approaches to machine learning relied heavily on feature engineering, since early machine learning algorithms required a reduced set of representative features that were manually chosen by researchers, such as features estimated from the spectrogram.
Deep learning does not require feature engineering (Stevens, Antiga, and Viehmann 2020), as the algorithms include a step that identifies relevant features from the input. This can lead to faster development time and increased ability to represent complex patterns typically seen in image and acoustic data. Convolutional neural networks (CNNs) — one of the most widely used deep learning algorithms—are useful for processing data that have a ‘grid-like topology’, such as image data that can be considered a 2-dimensional grid of pixels (Goodfellow, Bengio, and Courville 2016). The ‘convolutional’ layer learns the feature representations of the inputs; these convolutional layers consist of a set of filters which are basically two-dimensional matrices of numbers and the primary parameter is the number of filters (Gu et al. 2018). If training data are scarce, overfitting may occur as representations of images tend to be large with many variables (LeCun, Bengio, and others 1995).
Training deep learning models generally requires a large amount of training data and substantial computing resources. Transfer learning is an approach wherein the architecture of a pretrained CNN (which is generally trained on a very large dataset) is applied to a new classification problem. For example, CNNs trained on the ImageNet dataset of > 1 million images (Deng et al. 2009) such as ResNet have been applied to automated detection/classification of primate and bird species from PAM data (Dufourq et al. 2022; Ruan et al. 2022). Generally, very few practitioners train a CNN from scratch, and there are two common approaches for transfer learning. The first option is to use the CNN as a feature extractor, and train only the last classification layer. The second option is known as ‘fine-tuning’, where instead of initializing a neural network with random weights, the initialization is done using the pre-trained network. Using these pre-trained weights are valuable because the model has already learned useful feature representations (Takhirov 2021). Both approaches require substantially less computing power than training from scratch. The functions in the ‘gibbonNetR’ package allow users to train models using both types of transfer learning.
The two most popular open-source programming languages are R and Python (Scavetta and Angelov 2021). Python has surpassed R in terms of overall popularity, but R remains an important language for the life sciences (Lawlor et al. 2022). ‘Keras’ (Chollet and others 2015), ‘PyTorch’ (Paszke et al. 2019) and ‘Tensorflow’ (Martín Abadi et al. 2015) are some of the more popular neural network libraries; these libraries were all initially developed for the Python programming language. One of the earliest implementations of automated detection using R was the ‘monitoR’ package, that included functions for template detection (Katz, Hafner, and Donovan 2016). The ‘warbleR’ package included functions for energy-based detection, which identifies signals of interest in a certain frequency range above specified energy thresholds (Araya-Salas and Smith-Vidaurre 2017). The ‘gibbonR’ package combined energy-based detection with tranditional machine learning classification (Clink and Klinck 2019).
Until recently, deep learning implementations in R relied on the ‘reticulate’ package which served as an interface to Python (Ushey, Allaire, and Tang 2022). Early implementations of automated detection using deep learning in R relied on the ‘reticulate’ package Silva et al. (2022). However, the recent release of the ‘torch for R’ ecosystem provides a framework based on ‘PyTorch’ that runs natively in R and has no dependency on Python (Falbel 2023). Running natively in R means more straightforward installation, and higher accessibility for users of the R programming environment. Keydana (2023) provides tutorials for image and audio classfication in the ‘torch for R’ ecosystem, and the functionality in ‘gibbonNetR’ relies heavily on these tutorials. Variations of the transfer learning approaches included in this package have already been implemented in Python (Dufourq et al. 2022). Recent advances have used embeddings from audio classification models trained on bird songs for new classification problems, and in many cases these embeddings led to better performance than general audio or image datasets (Ghani et al. 2023).
This package provides functions to create spectrogram images using the ‘seewave’ package (J. Sueur, T. Aubin, and C. Simonis 2008), and train and deploy six CNN architectures: AlexNet (Krizhevsky, Sutskever, and Hinton 2017) , VGG16, VGG19 (Simonyan and Zisserman 2014), ResNet18, ResNet50, and ResNet152 (He et al. 2016)) trained on the ImageNet dataset (Deng et al. 2009 ). This package has been used for automated detection of gunshots (Vu et al. 2024) and the calls of two gibbon species (Clink, Kim, et al. 2024; Clink, Cross-Jaya, et al. 2024). The package also has functions to evaluate model performance, deploy the highest performing model over a directory of sound files, and extract embeddings from trained models to visualize acoustic data. We provide an example dataset that consists of labelled vocalizations of the loud calls of four vertebrates (see detailed description below) from Danum Valley Conservation Area, Sabah, Malaysia (Clink and Hamid Ahmad 2024). Detailed usage instructions for ‘gibbonNetR’ can be found Here
We include sound files and spectrogram images of five sound classes: great argus pheasant (Argusianus argus) long calls (Clink et al. 2021), helmeted hornbills (Rhinoplax vigil), and rhinoceros hornbills (Buceros rhinoceros) (Kennedy et al. 2023), female gibbons (Hylobates funereus) and a catch-all “noise” category. The data come from two separate PAM arrays in Danum Valley Conservation Area, Sabah, Malaysia. The training and validation data come from a wide array of Swift autonomous recording units placed on ~750 m spacing (Clink et al. 2023), and the test data come from a different, smaller array (~250 m spacing) within the same area. We used a band-limited energy detector to identify signals that were 3-sec or longer duration within the 400-1600 Hz range, and then a single observer (DJC) manually sorted the detections into their respective categories (Clink et al. 2023).
The package currently uses spectrogram images (Figure 1) to train and evaluate CNN model performance, and we includes a function that can be used to create spectrogram images from Waveform Audio File Formant (.wav) files. The .wav files should be organized into separate folders, with each folder named according to the class label of the files it contains. We highly recommend that your test data come from a different recording time and/or location to better understand the generalizability of the models (Stowell 2022).
The package currently allows for the training of six different CNN architectures (‘alexnet’, ‘vgg16’, ‘vgg19’, ‘resnet18’, ‘resnet50’, or ‘resnet152’), and the user can specify if they want to freeze the feature extraction layers or not. There is also the option to train a binary or multi-class classifer.
We can compare the performance of different CNN architectures (Figure 2). Using the ‘get_best_performance’ function we can evaluate the performance of different model architectures on the test dataset for the specified class. We can calculate the best F1, precision, recall using the ‘caret’ package (Kuhn 2008), and the area under the ROC (Receiver Operating Characteristic) curve using the ‘ROCR’ package (Sing et al. 2005), which is a threshold or confidence independent metric that evaluates the classifier’s ability to discriminate between positive and negative classes.
PerformanceOutput <- get_best_performance(performancetables.dir=performancetables.dir,
class='female.gibbon',
model.type = "multi",
Thresh.val=0)
PerformanceOutput$f1_plot
Embeddings from deep learning models can be used as features in unsupervised approaches, with promising results for call repertoires (Best et al. 2023) and individual identity (Lakdari et al. 2024). This package contains a function to use pretrained CNNs to extract embeddings, where the trained model path, along with test data location and target class are specified. Depending on the research question, this output could be used to visualize true and false positives from automated detection, or to explore differences in call types or potential number of individuals in the dataset.
In Figure 3 the top plot is a Uniform Manifold Approximation and Projection (UMAP) where each point represents one call, and the colors indicate the original class label. The bottom plot is the same UMAP plot, but with points colored based on cluster assignment by the ‘hdbscan’ algorithm (Hahsler, Piekenbrock, and Doran 2019).
We can calculate the Normalize Mutual Information score, which provides a value between 0 and 1, indicating the match between cluster labels and actual labels. We also create a confusion matrix using the ‘caret’ package (Kuhn 2008) which returns the results when we use the unsupervised clustering algorithm function ‘hdbscan’ (Hahsler, Piekenbrock, and Doran 2019) to match the target class to the cluster with the largest number of observations of that particular class.
There have been huge advances in the fields of deep learning and automated detection for PAM data in recent years. The approach presented in this package is one of the first to use the ‘torch for R’ ecosystem and to employ automated detection using deep learning natively in R. More recent approaches that use models that are explicitly trained on bioacoustics data, such as BirdNET (Ghani et al. 2023), have been introduced. There is a huge need in the field of bioacoustics to do benchmarking, wherein different model architectures and performance are compared across diverse datasets. The methods presented here can provide important benchmarks for future work, and for understanding how and if deep learning advances improve performance over more traditional methods. In addition, this package provides a comprehensive suite of tools for processing, analyzing, and visualizing acoustic data, providing robust support for tasks such as automated detection, feature extraction, classification, and data visualization, which are critical for conservation work using PAM. The R package is available on Github, where issues can be opened.
The research presented here adhered to all local and international laws. Institutional approval was provided by Cornell University (IACUC 2017–0098). Sabah Biodiversity Centre and the Danum Valley Management Committee provided permission for the collection of acoustic recordings.
We would like to thank the Sabah Biodiversity Centre and Danum Valley Conservation Area for granting us permission to conduct research. We are incredibly grateful for the detailed comments provided by Steffi LaZerte and Camille Desjonquères, which substantially improved the package and documentation.